{
 "metadata": {
  "name": "",
  "signature": "sha256:38ff0a946774364faf51a5cd9aef29290a641b4c3a02f6149920c93a27bd4f97"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Decision Trees"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "*By Parijat Mazumdar (GitHub ID: [mazumdarparijat](https://github.com/mazumdarparijat))*"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This notebook illustrates the use of [decision trees](http://en.wikipedia.org/wiki/Decision_tree) in Shogun for classification and regression. Various [decision tree learning](http://en.wikipedia.org/wiki/Decision_tree_learning) algorithms like [ID3](http://en.wikipedia.org/wiki/ID3_algorithm), [C4.5](http://en.wikipedia.org/wiki/C4.5_algorithm), [CART](http://themainstreamseer.blogspot.in/2013/01/introduction-to-classification.html), [CHAID](http://en.wikipedia.org/wiki/CHAID) have been discussed in detail using both intuitive toy datasets as well as real-world datasets."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Decision Tree Basics"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Decision Trees are a non-parametric supervised learning method that can be used for both classification and regression. Decision trees essentially encode a set of if-then-else rules which can be used to predict target variable given data features. These if-then-else rules are formed using the training dataset with the aim to satisfy as many training data instances as possible. The formation of these rules (aka. decision tree) from training data is called decision tree learning. Various decision tree learning algorithms have been developed and they work best in different situations. An advantage of decision trees is that they can model any type of function for classification or regression which other techniques cannot. But a decision tree is highly prone to overfitting and bias towards training data. So, decision trees are used for very large datasets which are assumed to represent the ground truth well. Additionally, certain [tree pruning algorithms](http://en.wikipedia.org/wiki/Pruning_%28decision_trees%29) are also used to tackle overfitting.   "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "ID3 (Iterative Dichotomiser 3)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "ID3 is a straightforward decision tree learning algorithm developed by [Ross Quinlan](http://en.wikipedia.org/wiki/Ross_Quinlan). ID3 is applicable only in cases where the attributes (or features) defining data examples are categorical in nature and the data examples belong to pre-defined, clearly distinguishable (ie. well defined) classes. ID3 is an iterative greedy algorithm which starts with the root node and eventually builds the entire tree. At each node, the \"best\" attribute to classify data is chosen. The \"best\" attribute is chosen using the [information gain metric](http://en.wikipedia.org/wiki/Information_gain_in_decision_trees). Once an attribute is chosen in a node, the data examples in the node are categorized into sub-groups based on the attribute values that they have. Basically, all data examples having the same attribute value are put together in the same sub-group. These sub-groups form the children of the present node and the algorithm is repeated for each of the newly formed children nodes. This goes on until all the data members of a node belong to the same class or all the attributes are exhausted. In the latter case, the class predicted may be erroneous and generally the mode of the classes appearing in the node is chosen as the predictive class. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Pseudocode for ID3 Algorithm"
     ]
    },
    {
     "cell_type": "raw",
     "metadata": {},
     "source": [
      "\tInputs: (D, A, C), where:\n",
      "\t\tD is a dataset with only nominal instance attributes A \n",
      "\t\tC is the class attribute\n",
      "\n",
      "\tOutput: a decision tree T representing a sequential decision process for\n",
      "\tclassifying instances (predicting the values of the class attribute C);\n",
      "\teach node of T is labeled with a non-class attribute of A\n",
      "\n",
      "\tInformal Inductive Bias: minimize the average height of the tree\n",
      "\n",
      "\tProcedure:\n",
      "\n",
      "\tif the set of remaining non-class attributes is empty \n",
      "\tor if all of the instances in D are in the same class\n",
      "    {\n",
      "\t\treturn an empty tree\n",
      "    }\n",
      "\telse \n",
      "    {\n",
      "\t\tcompute the Information Gain for each attribute over the dataset D\n",
      "\t\t\tlet a* be the attribute with maximum Information Gain\t\n",
      "\n",
      "\t\tcreate a root node for a tree T; label it with a* \n",
      "\n",
      "\t\tfor each value b of attribute a* \n",
      "        {\n",
      "\t\t\tlet T(a*=b) be the tree computed recursively by ID3 on input (D|a*=b, A-a*, C), where:\n",
      "\t\t\t\tD|a*=b contains all instances of D for which a* has the value b\n",
      "\t\t\t\tA-a* consists of all attributes of A except a*\n",
      "\n",
      "\t\t\tattach T(a*=b) to the root of T as a subtree\n",
      "\t\t}\n",
      "\t\t\n",
      "\t\treturn the resulting decision tree T\n",
      "\t}"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Example using a Simple dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section, we create a simple example where we try to predict the usage of mobile phones by individuals based on their income, age, education and marital status. Each of the attributes have been categorized into 2 or 3 types. Let us create the training dataset and tabulate it first."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# training data\n",
      "train_income=['Low','Medium','Low','High','Low','High','Medium','Medium','High','Low','Medium',\n",
      "'Medium','High','Low','Medium']\n",
      "\n",
      "train_age = ['Old','Young','Old','Young','Old','Young','Young','Old','Old','Old','Young','Old',\n",
      "'Old','Old','Young']\n",
      "\n",
      "train_education = ['University','College','University','University','University','College','College',\n",
      "'High School','University','High School','College','High School','University','High School','College']\n",
      "\n",
      "train_marital = ['Married','Single','Married','Single','Married','Single','Married','Single','Single',\n",
      "'Married','Married','Single','Single','Married','Married']\n",
      "\n",
      "train_usage = ['Low','Medium','Low','High','Low','Medium','Medium','Low','High','Low','Medium','Low',\n",
      "'High','Low','Medium']\n",
      "\n",
      "# print data\n",
      "print 'Training Data Table : \\n'\n",
      "print 'Income \\t\\t Age \\t\\t Education \\t\\t Marital Status \\t Usage'\n",
      "for i in xrange(len(train_income)):\n",
      "\tprint train_income[i]+' \\t\\t '+train_age[i]+' \\t\\t '+train_education[i]+' \\t\\t '+train_marital[i]+' \\t\\t '+train_usage[i]\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We want to create a decision tree from the above training dataset. The first step for that is to encode the data into numeric values and bind them to Shogun's features and multiclass labels."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import ID3ClassifierTree, RealFeatures, MulticlassLabels\n",
      "from numpy import array, concatenate\n",
      "\n",
      "# encoding dictionary\n",
      "income = {'Low' : 1.0, 'Medium' : 2.0, 'High' : 3.0}\n",
      "age = {'Young' : 1.0, 'Old' : 2.0}\n",
      "education = {'High School' : 1.0, 'College' : 2.0, 'University' : 3.0}\n",
      "marital_status = {'Married' : 1.0, 'Single' : 2.0}\n",
      "usage = {'Low' : 1.0, 'Medium' : 2.0, 'High' : 3.0}\n",
      "\n",
      "\n",
      "# encode training data\n",
      "for i in xrange(len(train_income)):\n",
      "\ttrain_income[i] = income[train_income[i]]\n",
      "\ttrain_age[i] = age[train_age[i]]\n",
      "\ttrain_education[i] = education[train_education[i]]\n",
      "\ttrain_marital[i] = marital_status[train_marital[i]]\n",
      "\ttrain_usage[i] = usage[train_usage[i]]\n",
      "    \n",
      "# form Shogun feature matrix\n",
      "train_data = array([train_income, train_age, train_education, train_marital])\n",
      "train_feats = RealFeatures(train_data);\n",
      "\n",
      "# form Shogun multiclass labels\n",
      "labels = MulticlassLabels(array(train_usage));"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, we learn our decision tree using the features and labels created."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# create ID3ClassifierTree object\n",
      "id3 = ID3ClassifierTree()\n",
      "\n",
      "# set labels\n",
      "id3.set_labels(labels)\n",
      "\n",
      "# learn the tree from training features\n",
      "is_successful = id3.train(train_feats)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Our decision tree is ready now and we want to use it to make some predictions over test data. So, let us create some test data examples first."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# test data\n",
      "test_income = ['Medium','Medium','Low','High','High']\n",
      "test_age = ['Old','Young','Old','Young','Old']\n",
      "test_education = ['University','College','High School','University','College']\n",
      "test_marital = ['Married','Single','Married','Single','Married']\n",
      "test_usage = ['Low','Medium','Low','High','High']\n",
      "\n",
      "# tabulate test data\n",
      "print 'Test Data Table : \\n'\n",
      "print 'Income \\t\\t Age \\t\\t Education \\t\\t Marital Status \\t Usage'\n",
      "for i in xrange(len(test_income)):\n",
      "\tprint test_income[i]+' \\t\\t '+test_age[i]+' \\t\\t '+test_education[i]+' \\t\\t '+test_marital[i]+' \\t\\t ?'\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, as with training data, we encode our test dataset and bind it to Shogun features. Then, we apply our decision tree to the test examples to obtain the predicted labels."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# encode test data\n",
      "for i in xrange(len(test_income)):\n",
      "\ttest_income[i] = income[test_income[i]]\n",
      "\ttest_age[i] = age[test_age[i]]\n",
      "\ttest_education[i] = education[test_education[i]]\n",
      "\ttest_marital[i] = marital_status[test_marital[i]]\n",
      "\n",
      "# bind to shogun features    \n",
      "test_data = array([test_income, test_age, test_education, test_marital])\n",
      "test_feats = RealFeatures(test_data)\n",
      "\n",
      "# apply decision tree classification\n",
      "test_labels = id3.apply_multiclass(test_feats)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally let us tabulate the results obtained and compare them with our intuitive predictions."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "output = test_labels.get_labels();\n",
      "output_labels=[0]*len(output)\n",
      "\n",
      "# decode back test data for printing\n",
      "for i in xrange(len(test_income)):\n",
      "\ttest_income[i]=income.keys()[income.values().index(test_income[i])]\n",
      "\ttest_age[i]=age.keys()[age.values().index(test_age[i])]\n",
      "\ttest_education[i]=education.keys()[education.values().index(test_education[i])]\n",
      "\ttest_marital[i]=marital_status.keys()[marital_status.values().index(test_marital[i])]\n",
      "\toutput_labels[i]=usage.keys()[usage.values().index(output[i])]\n",
      "\n",
      "# print output data\n",
      "print 'Final Test Data Table : \\n'\n",
      "print 'Income \\t Age \\t Education \\t Marital Status \\t Usage(predicted)'\n",
      "for i in xrange(len(test_income)):\n",
      "\tprint test_income[i]+' \\t '+test_age[i]+' \\t '+test_education[i]+' \\t '+test_marital[i]+' \\t\\t '+output_labels[i]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So, do the predictions made by our decision tree match our inferences from training set? Yes! For example, from the training set we infer that the individual having low income has low usage and also all individuals going to college have medium usage. The decision tree predicts the same for both cases. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Example using a real dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We choose the [car evaluation dataset](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.html) as our real-world dataset. The [car.names](https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.c45-names) file of the dataset enumerates the class categories as well as the non-class attributes. Each car categorized into one of 4 classes : unacc, acc, good, vgood. Each car is judged using 6 attributes : buying, maint, doors, persons, lug_boot, safety. Each of these attributes can take 3-4 values. Let us first make a dictionary to encode strings to numeric values using information from cars.names file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# class attribute\n",
      "evaluation = {'unacc' : 1.0, 'acc' : 2.0, 'good' : 3.0, 'vgood' : 4.0}\n",
      "\n",
      "# non-class attributes\n",
      "buying = {'vhigh' : 1.0, 'high' : 2.0, 'med' : 3.0, 'low' : 4.0}\n",
      "maint = {'vhigh' : 1.0, 'high' : 2.0, 'med' : 3.0, 'low' : 4.0}\n",
      "doors = {'2' : 1.0, '3' : 2.0, '4' : 3.0, '5more' : 4.0}\n",
      "persons = {'2' : 1.0, '4' : 2.0, 'more' : 3.0}\n",
      "lug_boot = {'small' : 1.0, 'med' : 2.0, 'big' : 3.0}\n",
      "safety = {'low' : 1.0, 'med' : 2.0, 'high' : 3.0}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, let us read the file and form Shogun features and labels."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f = open('../../../../data/uci/car/car.data', 'r')\n",
      "\n",
      "features = []\n",
      "labels = []\n",
      "\n",
      "# read data from file and encode\n",
      "for line in f:\n",
      "    words = line.rstrip().split(',')\n",
      "    words[0] = buying[words[0]]\n",
      "    words[1] = maint[words[1]]\n",
      "    words[2] = doors[words[2]]\n",
      "    words[3] = persons[words[3]]\n",
      "    words[4] = lug_boot[words[4]]\n",
      "    words[5] = safety[words[5]]\n",
      "    words[6] = evaluation[words[6]]\n",
      "    features.append(words[0:6])\n",
      "    labels.append(words[6])\n",
      "\n",
      "f.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the entire dataset, let us choose some test vectors to form our test dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import random, delete\n",
      "\n",
      "features = array(features)\n",
      "labels = array(labels)\n",
      "\n",
      "# number of test vectors\n",
      "num_test_vectors = 200;\n",
      "\n",
      "test_indices = random.randint(features.shape[0], size = num_test_vectors)\n",
      "test_features = features[test_indices]\n",
      "test_labels = labels[test_indices]\n",
      "\n",
      "# remove test vectors from training set\n",
      "features = delete(features,test_indices,0)\n",
      "labels = delete(labels,test_indices,0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next step is to train our decision tree using the training features and applying it to our test dataset to get predicted output classes."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# shogun test features and labels\n",
      "test_feats = RealFeatures(test_features.T)\n",
      "test_labels = MulticlassLabels(test_labels)\n",
      "\n",
      "# method for id3 training and\n",
      "def ID3_routine(features, labels):\n",
      "\n",
      "    # Shogun train features and labels\n",
      "    train_feats = RealFeatures(features.T)\n",
      "    train_lab = MulticlassLabels(labels)\n",
      "\n",
      "    # create ID3ClassifierTree object\n",
      "    id3 = ID3ClassifierTree()\n",
      "\n",
      "    # set labels\n",
      "    id3.set_labels(train_lab)\n",
      "\n",
      "    # learn the tree from training features\n",
      "    id3.train(train_feats)\n",
      "\n",
      "    # apply to test dataset\n",
      "    output = id3.apply_multiclass(test_feats)\n",
      "    \n",
      "    return output\n",
      "\n",
      "output = ID3_routine(features, labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally, let us compare our predicted labels with test labels to find out the percentage error of our classification model."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import MulticlassAccuracy\n",
      "\n",
      "# Shogun object for calculating multiclass accuracy\n",
      "accuracy = MulticlassAccuracy()\n",
      "print 'Accuracy : ' + str(accuracy.evaluate(output, test_labels))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see that the accuracy is moderately high. Thus our decision tree can evaluate any car given its features with a high success rate. As a final exercise, let us examine the effect of training dataset size on the accuracy of decision tree."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# list of error rates for all training dataset sizes\n",
      "error_rate = []\n",
      "\n",
      "# number of error rate readings taken for each value of dataset size\n",
      "num_repetitions = 3\n",
      "\n",
      "# loop over training dataset size\n",
      "for i in range(500,1600,200):\n",
      "    indices = random.randint(features.shape[0], size = i)\n",
      "    train_features = features[indices]\n",
      "    train_labels = labels[indices]\n",
      "    \n",
      "    average_error = 0\n",
      "    for i in xrange(num_repetitions):\n",
      "        output = ID3_routine(train_features, train_labels)\n",
      "        average_error = average_error + (1-accuracy.evaluate(output, test_labels))\n",
      "        \n",
      "    error_rate.append(average_error/num_repetitions)\n",
      "\n",
      "# plot the error rates    \n",
      "import matplotlib.pyplot as pyplot\n",
      "% matplotlib inline\n",
      "from scipy.interpolate import interp1d\n",
      "from numpy import linspace, arange\n",
      "\n",
      "fig,axis = pyplot.subplots(1,1)\n",
      "x = arange(500,1600,200)\n",
      "f = interp1d(x, error_rate)\n",
      "\n",
      "xnew = linspace(500,1500,100)\n",
      "pyplot.plot(x,error_rate,'o',xnew,f(xnew),'-')\n",
      "pyplot.xlim([400,1600])\n",
      "pyplot.xlabel('training dataset size')\n",
      "pyplot.ylabel('Classification Error')\n",
      "pyplot.title('Decision Tree Performance')\n",
      "pyplot.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 6,
     "metadata": {},
     "source": [
      "NOTE : The above code snippet takes about half a minute to execute. Please wait patiently."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the above plot, we see that error rate decreases steadily as we increase the training dataset size. Although in this case, the decrease in error rate is not very significant, in many datasets this decrease in error rate can be substantial."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "C4.5"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The C4.5 algorithm is essentially an extension of the ID3 algorithm for decision tree learning. It has the additional capability of handling continuous attributes and attributes with missing values. The tree growing process in case of C4.5 is same as that of ID3 i.e. finding the best split at each node using the [information gain metric](http://en.wikipedia.org/wiki/Information_gain_in_decision_trees). But in case of continuous attribute, the C4.5 algorithm has to perform the additional step of converting it to a two-value categorical attribute by splitting about a suitable threshold. This threshold is chosen in a way such that the resultant split produces maximum information gain. Let us start exploring Shogun's C4.5 algorithm API with a toy example. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Example using toy dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us consider a 3-class classification using 2 attributes. One of the attributes (say attribute X) is a 2-class categorical attribute depicted by values 1 and 2. The other attribute (say attribute Y) is a continuous attribute having values between 1 and 9. The simple rules of classification are as follows : if X=1 and Y $\\epsilon$ [1,5), data point belongs to class 1, if X=1 and Y $\\epsilon$ [5,9), data point belongs to class 2 and if X=2, data point belongs to class 3. Let us realize the toy dataset by plotting it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as plt\n",
      "from numpy import ones, zeros, random, concatenate\n",
      "from modshogun import RealFeatures, MulticlassLabels\n",
      "% matplotlib inline\n",
      "\n",
      "def create_toy_classification_dataset(ncat,do_plot):\n",
      "    # create attribute values and labels for class 1\n",
      "    x = ones((1,ncat))\n",
      "    y = 1+random.rand(1,ncat)*4\n",
      "    lab = zeros(ncat)\n",
      "\n",
      "    # add attribute values and labels for class 2\n",
      "    x = concatenate((x,ones((1,ncat))),1)\n",
      "    y = concatenate((y,5+random.rand(1,ncat)*4),1)\n",
      "    lab = concatenate((lab,ones(ncat)))\n",
      "\n",
      "    # add attribute values and labels for class 3\n",
      "    x = concatenate((x,2*ones((1,ncat))),1)\n",
      "    y = concatenate((y,1+random.rand(1,ncat)*8),1)\n",
      "    lab = concatenate((lab,2*ones(ncat)))\n",
      "\n",
      "    # create test data\n",
      "    ntest = 20\n",
      "    x_t = concatenate((ones((1,3*ntest/4)),2*ones((1,ntest/4))),1)\n",
      "    y_t = 1+random.rand(1,ntest)*8\n",
      "\n",
      "    if do_plot:\n",
      "        # plot training data\n",
      "        c = ['r','g','b']\n",
      "        for i in range(3):\n",
      "            plt.scatter(x[0,lab==i],y[0,lab==i],color=c[i],marker='x',s=50)\n",
      "\n",
      "        # plot test data\n",
      "        plt.scatter(x_t[0,:],y_t[0,:],color='k',s=10,alpha=0.8)\n",
      "\n",
      "        plt.xlabel('attribute X')\n",
      "        plt.ylabel('attribute Y')\n",
      "        plt.show()\n",
      "\n",
      "    # form training feature matrix\n",
      "    train_feats = RealFeatures(concatenate((x,y),0))\n",
      "\n",
      "    # from training labels\n",
      "    train_labels = MulticlassLabels(lab)\n",
      "\n",
      "    # from test feature matrix\n",
      "    test_feats = RealFeatures(concatenate((x_t,y_t),0))\n",
      "    \n",
      "    return (train_feats,train_labels,test_feats);\n",
      "\n",
      "train_feats,train_labels,test_feats = create_toy_classification_dataset(20,True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the above plot the training data points are are marked with different colours of crosses where each colour corresponds to a particular label. The test data points are marked by black circles. For us it is a trivial task to assign correct colours (i.e. labels) to the black points. Let us see how accurately C4.5 assigns colours to these test points."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let us train a decision tree using the C4.5 algorithm. We need to create a Shogun C4.5 tree object and supply training features and training labels to it. We also need to specify which attribute is categorical and which is continuous. The attribute types can be specified using [set_feature_types](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CC45ClassifierTree.html#a6fed413e6eee6a4d798613ea34d14512) method through which all categorical attributes are set as True and continuous attributes as False. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import array\n",
      "from modshogun import C45ClassifierTree\n",
      "\n",
      "# steps in C4.5 Tree training bundled together in a python method\n",
      "def train_tree(feats,types,labels):\n",
      "    # C4.5 Tree object\n",
      "    tree = C45ClassifierTree()\n",
      "    # set labels\n",
      "    tree.set_labels(labels)\n",
      "    # supply attribute types\n",
      "    tree.set_feature_types(types)\n",
      "    # supply training matrix and train\n",
      "    tree.train(feats)\n",
      "    \n",
      "    return tree\n",
      "\n",
      "# specify attribute types X is categorical hence True, Y is continuous hence False\n",
      "feat_types = array([True,False])\n",
      "\n",
      "# get back trained tree\n",
      "C45Tree = train_tree(train_feats,feat_types,train_labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that we have trained the decision tree, we can use it to classify our test vectors."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def classify_data(tree,data):\n",
      "    # get classification labels\n",
      "    output = tree.apply_multiclass(data)   \n",
      "    # get classification certainty\n",
      "    output_certainty=tree.get_certainty_vector()\n",
      "    \n",
      "    return output,output_certainty\n",
      "\n",
      "out_labels,out_certainty = classify_data(C45Tree,test_feats)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us use the output labels to colour our test data points to qualitatively judge the performance of the decision tree.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import int32\n",
      "\n",
      "# plot results\n",
      "def plot_toy_classification_results(train_feats,train_labels,test_feats,test_labels):\n",
      "    train = train_feats.get_feature_matrix()\n",
      "    lab = train_labels.get_labels()\n",
      "    test = test_feats.get_feature_matrix()\n",
      "    out_labels = test_labels.get_labels()\n",
      "    \n",
      "    c = ['r','g','b']\n",
      "    for i in range(out_labels.size):\n",
      "        plt.scatter(test[0,i],test[1,i],color=c[int32(out_labels[i])],s=50)\n",
      "\n",
      "    # plot training dataset for visual comparison\n",
      "    for i in range(3):\n",
      "        plt.scatter(train[0,lab==i],train[1,lab==i],color=c[i],marker='x',s=30,alpha=0.7)\n",
      "\n",
      "    plt.show()\n",
      "    \n",
      "plot_toy_classification_results(train_feats,train_labels,test_feats,out_labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see that the decision tree trained using the C4.5 algorithm works almost perfectly in this toy dataset. Now let us try this algorithm on a real world dataset. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Example using a real dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section we will investigate that how accurately we can predict the species of an Iris flower using a C4.5 trained decision tree. In this example we will use petal length, petal width, sepal length and sepal width as our attributes to decide among 3 classes of Iris : Iris Setosa, Iris Versicolor and Iris Verginica. Let us start by suitably reading the dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import csv\n",
      "from numpy import array\n",
      "\n",
      "# dictionary to encode class names to class labels\n",
      "to_label = {'Iris-setosa' : 0.0, 'Iris-versicolor' : 1.0, 'Iris-virginica' : 2.0}\n",
      "\n",
      "# read csv file and separate out labels and features\n",
      "lab = []\n",
      "feat = []\n",
      "with open('../../../../data/uci/iris/iris.data') as csvfile:\n",
      "    csvread = csv.reader(csvfile,delimiter=',')\n",
      "    for row in csvread:\n",
      "        feat.append([float(i) for i in row[0:4]])\n",
      "        lab.append(to_label[row[4]])\n",
      "\n",
      "lab = array(lab)\n",
      "feat = array(feat).T"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Because there is no separate test dataset, we first divide the given dataset into training and testing subsets."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import int32, random\n",
      "\n",
      "# no.of vectors in test dataset\n",
      "ntest = 25\n",
      "# no. of vectors in train dataset\n",
      "ntrain = 150-ntest\n",
      "\n",
      "# randomize the order of vectors\n",
      "subset = int32(random.permutation(150))\n",
      "\n",
      "# choose 1st ntrain from randomized set as training vectors\n",
      "feats_train = feat[:,subset[0:ntrain]]\n",
      "# form training labels correspondingly\n",
      "train_labels = lab[subset[0:ntrain]]\n",
      "\n",
      "# form test features and labels (for accuracy evaluations)\n",
      "feats_test = feat[:,subset[ntrain:ntrain+ntest]]\n",
      "test_labels = lab[subset[ntrain:ntrain+ntest]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Before marching forward with applying C4.5, let us plot the data to get a better understanding. The given data points are 4-D and hence cannot be conveniently plotted. We need to reduce the number of dimensions to 2. This reduction can be achieved using any dimension reduction algorithm like PCA. However for the sake of brevity, let us just choose two highly correlated dimensions, petal width and petal length (see [summary statistics](http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names)), right away for plotting.   "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as plt\n",
      "% matplotlib inline\n",
      "\n",
      "# plot training features\n",
      "c = ['r', 'g', 'b']\n",
      "for i in range(3):\n",
      "    plt.scatter(feats_train[2,train_labels==i],feats_train[3,train_labels==i],color=c[i],marker='x')\n",
      "\n",
      "# plot test data points in black\n",
      "plt.scatter(feats_test[2,:],feats_test[3,:],color='k',marker='o')\n",
      "\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, let us create Shogun features and labels from the given data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import RealFeatures, MulticlassLabels\n",
      "\n",
      "# training data\n",
      "feats_train = RealFeatures(feats_train)\n",
      "train_labels = MulticlassLabels(train_labels)\n",
      "\n",
      "# test data\n",
      "feats_test = RealFeatures(feats_test)\n",
      "test_labels = MulticlassLabels(test_labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We know for fact that decision trees tend to overfit. Hence pruning becomes a necessary step. In case of toy dataset, we skipped the pruning step because the dataset was simple and noise-free. But in case of a real dataset like the Iris dataset pruning cannot be skipped. So we have to partition the training dataset into the training subset and the validation subset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# randomize the order of vectors\n",
      "subset = int32(random.permutation(ntrain))\n",
      "\n",
      "nvalidation = 45\n",
      "\n",
      "# form training subset and validation subset\n",
      "train_subset = subset[0:ntrain-nvalidation]\n",
      "validation_subset = subset[ntrain-nvalidation:ntrain]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we train the decision tree first, then prune it and finally use it to get output labels for test vectors. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# set attribute types - all continuous\n",
      "feature_types = array([False, False, False, False])\n",
      "\n",
      "# remove validation subset before training the tree\n",
      "feats_train.add_subset(train_subset)\n",
      "train_labels.add_subset(train_subset)\n",
      "\n",
      "# train tree\n",
      "C45Tree = train_tree(feats_train,feature_types,train_labels)\n",
      "\n",
      "# bring back validation subset\n",
      "feats_train.remove_subset()\n",
      "train_labels.remove_subset()\n",
      "\n",
      "# remove data belonging to training subset\n",
      "feats_train.add_subset(validation_subset)\n",
      "train_labels.add_subset(validation_subset)\n",
      "\n",
      "# prune the tree\n",
      "C45Tree.prune_tree(feats_train,train_labels)\n",
      "\n",
      "# bring back training subset\n",
      "feats_train.remove_subset()\n",
      "train_labels.remove_subset()\n",
      "\n",
      "# get results\n",
      "output, output_certainty = classify_data(C45Tree,feats_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us calculate the accuracy of the classification made by our tree as well as plot the results for qualitative evaluation."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import MulticlassAccuracy\n",
      "\n",
      "# Shogun object for calculating multiclass accuracy\n",
      "accuracy = MulticlassAccuracy()\n",
      "print 'Accuracy : ' + str(accuracy.evaluate(output, test_labels))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# convert MulticlassLabels object to labels vector\n",
      "output = output.get_labels()\n",
      "test_labels = test_labels.get_labels()\n",
      "train_labels = train_labels.get_labels()\n",
      "\n",
      "# convert RealFeatures object to matrix\n",
      "feats_test = feats_test.get_feature_matrix()\n",
      "feats_train = feats_train.get_feature_matrix()\n",
      "\n",
      "# plot ground truth\n",
      "for i in range(3):\n",
      "    plt.scatter(feats_test[2,test_labels==i],feats_test[3,test_labels==i],color=c[i],marker='x',s=100)\n",
      "\n",
      "# plot predicted labels\n",
      "for i in range(output.size):\n",
      "    plt.scatter(feats_test[2,i],feats_test[3,i],color=c[int32(output[i])],marker='o',s=30*output_certainty[i])\n",
      "        \n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the evaluation of results, we infer that, with the help of a C4.5 trained decision tree, we can predict (with high accuracy) the type of Iris plant given its petal and sepal widths and lengths."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Classification and Regression Trees (CART)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The CART algorithm is a popular decision tree learning algorithm introduced by Breiman et al. Unlike ID3 and C4.5, the learnt decision tree in this case can be used for both multiclass classification and regression depending on the type of dependent variable. The tree growing process comprises of recursive binary splitting of nodes. To find the best split at each node, all possible splits of all available predictive attributes are considered. The best split is the one that maximises some splitting criterion. For classification tasks, ie. when the dependent attribute is categorical, the [Gini index](http://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) is used as the splitting criterion. For regression tasks, ie. when the dependent variable is continuous, the [least squares deviation](http://en.wikipedia.org/wiki/Least_squares) is used. Let us learn about [Shogun's CART implementation](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CCARTree.html) by working on two toy problems, one on classification and the other on regression."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Classification example using toy data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us consider the same dataset as that in the C4.5 toy example. We re-create the dataset and plot it first."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train_feats,train_labels,test_feats=create_toy_classification_dataset(20,True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, we supply necessary parameters to the CART algorithm and use it train our decision tree. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import PT_MULTICLASS, CARTree\n",
      "from numpy import array\n",
      "\n",
      "def train_carttree(feat_types,problem_type,num_folds,use_cv_pruning,labels,features):\n",
      "    # create CART tree object\n",
      "    c = CARTree(feat_types,problem_type,num_folds,use_cv_pruning)\n",
      "    # set training labels\n",
      "    c.set_labels(labels)\n",
      "    # train using training features\n",
      "    c.train(features)\n",
      "    \n",
      "    return c\n",
      "\n",
      "# form feature types True for nominal (attribute X), False for ordinal/continuous (attribute Y)\n",
      "ft = array([True, False])\n",
      "\n",
      "# get back trained tree\n",
      "cart = train_carttree(ft, PT_MULTICLASS, 5, True, train_labels, train_feats)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the above code snippet, we see four parameters being supplied to the CART tree object. ```feat_types``` supplies knowledge of attribute types of training data to the CART algorithm and ```problem_type``` specifies whether it is a multiclass classification problem (```PT_MULTICLASS```) or a regression problem (```PT_REGRESSION```). The boolean parameter ```use_cv_pruning``` switches on cross-validation pruning  of the trained tree and ```num_folds``` specifies the number of folds of cross-validation to be applied while pruning. At this point, let us divert ourselves briefly towards undertanding what kind of pruning strategy is employed by Shogun's CART implementation. The CART algorithm uses the cost-complexity pruning strategy. Cost-Complexity pruning yields a list of subtrees of varying depths using complexity normalized resubstitution error, $R_\\alpha(T)$. Resubstitution error, R(T), measures how well a decision tree fits the training data. But, this measure favours larger trees over smaller ones. Hence the complexity normalized resubstitution error metric is used which adds penalty for increased complexity and in-turn counters overfitting.\n",
      "\n",
      "$R_\\alpha(T)=R(T)+\\alpha \\times (numleaves)$\n",
      "\n",
      "The best subtree among the list of subtrees can be chosen using cross validation or using the best-fit metric in the validation dataset. Setting ```use_cv_pruning``` in the above code snippet basically tells the CART object to use cross-validation to choose the best among the subtrees generated by cost-complexity pruning.\n",
      "\n",
      "Let us now get back on track and use the trained tree to classify our test data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import int32\n",
      "\n",
      "# get output labels\n",
      "output_labels = cart.apply_multiclass(test_feats)\n",
      "\n",
      "plot_toy_classification_results(train_feats,train_labels,test_feats,output_labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Regression example using toy data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this example, we form the training dataset by sampling points from a sinusoidal curve and see how well a decision tree, trained using these samples, re-creates the actual sinusoid. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import RegressionLabels, RealFeatures\n",
      "from numpy import random, sin, linspace\n",
      "import matplotlib.pyplot as plt\n",
      "% matplotlib inline\n",
      "\n",
      "def create_toy_regression_dataset(nsamples,noise_var):\n",
      "    # randomly choose positions in X axis between 0 to 16\n",
      "    samples_x = random.rand(1,nsamples)*16\n",
      "\n",
      "    # find out y (=sin(x)) values for the sampled x positions and add noise to it\n",
      "    samples_y = sin(samples_x)+(random.rand(1,nsamples)-0.5)*noise_var\n",
      "\n",
      "    # plot the samples\n",
      "    plt.scatter(samples_x,samples_y,color='b',marker='x')\n",
      "    \n",
      "    # create training features\n",
      "    train_feats = RealFeatures(samples_x)\n",
      "    # training labels\n",
      "    train_labels = RegressionLabels(samples_y[0,:])\n",
      "\n",
      "    return (train_feats,train_labels)\n",
      "\n",
      "# plot the reference sinusoid\n",
      "def plot_ref_sinusoid():\n",
      "    plot_x = linspace(-2,18,100)\n",
      "    plt.plot(plot_x,sin(plot_x),color='y',linewidth=1.5)\n",
      "    plt.xlabel('Feature values')\n",
      "    plt.ylabel('Labels')\n",
      "    plt.xlim([-3,19])\n",
      "    plt.ylim([-1.5,1.5])\n",
      "\n",
      "# number of samples is 300, noise variance is 0.5\n",
      "train_feats,train_labels = create_toy_regression_dataset(300,0.5)\n",
      "\n",
      "plot_ref_sinusoid()\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, we train our CART-tree."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import PT_REGRESSION\n",
      "from numpy import array\n",
      "\n",
      "# feature type - continuous\n",
      "feat_type = array([False])\n",
      "\n",
      "# get back trained tree\n",
      "cart = train_carttree(feat_type, PT_REGRESSION, 5, True, train_labels, train_feats)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let us use the trained decision tree to regress over the entire range of the previously depicted sinusoid."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def plot_predicted_sinusoid(cart):\n",
      "    # regression range - 0 to 16\n",
      "    x_test = array([linspace(0,16,100)])\n",
      "\n",
      "    # form Shogun features\n",
      "    test_feats = RealFeatures(x_test)\n",
      "\n",
      "    # apply regression using our previously trained CART-tree\n",
      "    regression_output = cart.apply_regression(test_feats).get_labels()\n",
      "\n",
      "    # plot the result\n",
      "    plt.plot(x_test[0,:],regression_output,linewidth=2.0)\n",
      "\n",
      "    # plot reference sinusoid\n",
      "    plot_ref_sinusoid()\n",
      "\n",
      "    plt.show()\n",
      "    \n",
      "plot_predicted_sinusoid(cart)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we can see from the above plot, CART-induced decision tree follows the reference sinusoid quite beautifully!"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Classification example using real dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section, we will apply the CART algorithm on the Iris dataset. Remember that the Iris dataset provides us with just a training dataset and no separate test dataset. In case of the C4.5 example discussed earlier, we ourselves divided the entire training dataset into training subset and test subset. In this section, we will employ a different strategy i.e. cross validation. In cross-validation, we divide the training dataset into ```n``` subsets where ```n``` is a user controlled parameter. We perform ```n``` iterations of training and testing in which, at each iteration, we choose one of the ```n``` subsets as our test dataset and the remaining ```n-1``` subsets as our training dataset. The performance of the model is usually taken as the average of the performances in various iterations. [Shogun's cross validation class](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CCrossValidation.html) makes it really easy to apply cross-validation to any model of our choice. Let us realize this by applying cross-validation to CART-tree trained over Iris dataset. We start by reading the data. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import csv\n",
      "from numpy import array\n",
      "import matplotlib.pylab as plt \n",
      "% matplotlib inline\n",
      "\n",
      "# dictionary to encode class names to class labels\n",
      "to_label = {'Iris-setosa' : 0.0, 'Iris-versicolor' : 1.0, 'Iris-virginica' : 2.0}\n",
      "\n",
      "# read csv file and separate out labels and features\n",
      "lab = []\n",
      "feat = []\n",
      "with open('../../../../data/uci/iris/iris.data') as csvfile:\n",
      "    csvread = csv.reader(csvfile,delimiter=',')\n",
      "    for row in csvread:\n",
      "        feat.append([float(i) for i in row[0:4]])\n",
      "        lab.append(to_label[row[4]])\n",
      "\n",
      "lab = array(lab)\n",
      "feat = array(feat).T\n",
      "\n",
      "# plot the dataset using two highly correlated attributes\n",
      "c = ['r', 'g', 'b']\n",
      "for i in range(3):\n",
      "    plt.scatter(feat[2,lab==i],feat[3,lab==i],color=c[i],marker='x')\n",
      "    \n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, we setup the model which is CART-tree in this case."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import CARTree, PT_MULTICLASS\n",
      "\n",
      "# set attribute types - all continuous\n",
      "feature_types = array([False, False, False, False])\n",
      "\n",
      "# setup CART-tree with cross validation pruning switched off\n",
      "cart = CARTree(feature_types,PT_MULTICLASS,5,False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally we can use Shogun's cross-validation class to get performance."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import RealFeatures, MulticlassLabels\n",
      "from modshogun import CrossValidation, MulticlassAccuracy, CrossValidationSplitting, CrossValidationResult\n",
      "\n",
      "# training features\n",
      "feats_train = RealFeatures(feat)\n",
      "# training labels\n",
      "labels_train = MulticlassLabels(lab)\n",
      "\n",
      "# set evaluation criteria - multiclass accuracy\n",
      "accuracy = MulticlassAccuracy()\n",
      "\n",
      "# set splitting criteria - 10 fold cross-validation\n",
      "split = CrossValidationSplitting(labels_train,10)\n",
      "\n",
      "# set cross-validation parameters \n",
      "cross_val = CrossValidation(cart,feats_train,labels_train,split,accuracy,False)\n",
      "\n",
      "# run cross-validation multiple times - to get better estimate of accuracy\n",
      "cross_val.set_num_runs(10)\n",
      "\n",
      "# get cross validation result\n",
      "result = cross_val.evaluate()\n",
      "\n",
      "# print result\n",
      "print('Mean Accuracy : ' + str(CrossValidationResult.obtain_from_generic(result).mean))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We get a mean accuracy of about 0.93-0.94. This number essentially means that a CART-tree trained using this dataset is expected to classify Iris flowers, given their required attributes, with an accuracy of 93-94% in a real world scenario. The parameters required by Shogun's cross-validation class should be noted in the above code snippet. The class requires the model, training features, training labels, [splitting strategy](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CSplittingStrategy.html) and [evaluation method](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CEvaluation.html) to be specified. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Regression using real dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section, we evaluate CART-induced decision tree over the [Servo dataset](https://archive.ics.uci.edu/ml/datasets/Servo). Using this dataset, we essentially want to train a model  which can predict the rise time of a servomechanism given the required parameters which are the two (integer) gain settings and two (nominal) choices of mechanical linkages. Let us read the dataset first."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import array\n",
      "\n",
      "# dictionary to convert string features to integer values\n",
      "to_int = {'A' : 1, 'B' : 2, 'C' : 3, 'D' : 4, 'E' : 5}\n",
      "\n",
      "# read csv file and separate out labels and features\n",
      "lab = []\n",
      "feat = []\n",
      "with open('../../../../data/uci/servo/servo.data') as csvfile:\n",
      "    csvread = csv.reader(csvfile,delimiter=',')\n",
      "    for row in csvread:\n",
      "        feat.append([to_int[row[0]], to_int[row[1]], float(row[2]), float(row[3])])\n",
      "        lab.append(float(row[4]))\n",
      "\n",
      "lab = array(lab)\n",
      "feat = array(feat).T"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The servo dataset is a small training dataset (contains just 167 training vectors) with no separate test dataset, like the Iris dataset. Hence we will apply the same cross-validation strategy we applied in case of the Iris dataset. However, to make things interesting let us play around with a yet-untouched parameter of CART-induced tree i.e. the [maximum allowed tree depth](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CCARTree.html#a8ae1a862da33e74ab9391b44dfff708f). As the tree depth increases, the tree becomes more complex and hence fits the training data more closely. By setting a maximum allowed tree depth, we restrict the complexity of trained tree and hence avoid over-fitting. But choosing a low value of the maximum allowed tree depth may lead to early stopping i.e. under-fitting. Let us explore how we can decide the appropriate value of the max-allowed-tree-depth parameter. Let us create a method, which takes max-allowed-tree-depth parameter as input and returns the corresponding cross-validated error as output."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import CARTree, RegressionLabels, PT_REGRESSION, MeanSquaredError\n",
      "from modshogun import CrossValidation, CrossValidationSplitting, CrossValidationResult\n",
      "\n",
      "# form training features\n",
      "feats_train = RealFeatures(feat)\n",
      "# form training labels\n",
      "labels_train = RegressionLabels(lab)\n",
      "\n",
      "def get_cv_error(max_depth):\n",
      "    # set attribute types - 2 nominal and 2 ordinal\n",
      "    feature_types = array([True, True, False, False])\n",
      "    # setup CART-tree with cross validation pruning switched off\n",
      "    cart = CARTree(feature_types,PT_REGRESSION,5,False)\n",
      "    # set max allowed depth\n",
      "    cart.set_max_depth(max_depth)\n",
      "\n",
      "    # set evaluation criteria - mean squared error\n",
      "    accuracy = MeanSquaredError()\n",
      "    # set splitting criteria - 10 fold cross-validation\n",
      "    split = CrossValidationSplitting(labels_train,10)\n",
      "    # set cross-validation parameters \n",
      "    cross_val = CrossValidation(cart,feats_train,labels_train,split,accuracy,False)\n",
      "\n",
      "    # run cross-validation multiple times\n",
      "    cross_val.set_num_runs(10)\n",
      "\n",
      "    # return cross validation result\n",
      "    return CrossValidationResult.obtain_from_generic(cross_val.evaluate()).mean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, let us supply a range of ```max_depth``` values to the above method and plot the returned cross-validated errors."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as plt\n",
      "\n",
      "cv_errors = [get_cv_error(i) for i in range(1,15)]\n",
      "plt.plot(range(1,15),cv_errors,'bo',range(1,15),cv_errors,'k')\n",
      "plt.xlabel('max_allowed_depth')\n",
      "plt.ylabel('cross-validated error')\n",
      "plt.ylim(0,1.2)\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the above plot quite clearly gives us the most appropriate value of maximum allowed depth. We see that the first minima occurs at a maximum allowed depth of 6-8. Hence, one of these should be the desired value. It is to be noted that the error metric that we are discussing here is the mean squared error. Thus, from the above plot, we can also claim that, given required parameters, our CART-flavoured decision tree can predict the rise time within an average error range of $\\pm0.5$ (i.e. square root of 0.25 which is the approximate minimum cross-validated error). The relative error i.e  ```average_error/range_of_labels``` comes out to be ~30%."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "CHi-squared Automatic Interaction Detection (CHAID)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The CHAID is an algorithm for decision tree learning proposed by Kass (1980). It is similar in functionality to CART in the sense that both can be used for classification as well as regression. But unlike CART, CHAID internally handles only categorical features. The continuous features are first converted into ordinal categorical features for the CHAID algorithm to be able to use them. This conversion is done by binning of feature values.The number of bins (K) has to be supplied by the user. Given K, a predictor is split in such a way that all the bins get the same number (more or less) of distinct predictor values. The maximum feature value in each bin is used as a breakpoint.\n",
      "\n",
      "An important parameter in the CHAID tree growing process is the p-value. The p-value is the metric that is used for deciding which categories of predictor values to merge during merging as well as for deciding the best attribute during splitting. The p-value is calculated using different hypothesis testing methods depending on the type of dependent variable (nominal, ordinal or continuous). A more detailed discussion on the CHAID algorithm can be found in the [documentation](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CCHAIDTree.html) of the ```CCHAIDTree``` class in Shogun. Let us move on to a more interesting topic which is learning to use CHAID using Shogun's python API.  "
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Classification example using toy dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let us re-use the toy classification dataset used in C4.5 and CART to see the API usage of CHAID as well as to qualitatively compare the results of the CHAID algorithm with the other two."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train_feats,train_labels,test_feats = create_toy_classification_dataset(20,True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we set up our CHAID-tree with appropriate parameters and train over given data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import PT_MULTICLASS, CHAIDTree\n",
      "from numpy import array, dtype, int32\n",
      "\n",
      "def train_chaidtree(dependent_var_type,feature_types,num_bins,features,labels):\n",
      "    # create CHAID tree object\n",
      "    c = CHAIDTree(dependent_var_type,feature_types,num_bins)\n",
      "    # set training labels\n",
      "    c.set_labels(labels)\n",
      "    # train using training features\n",
      "    c.train(features)\n",
      "    \n",
      "    return c\n",
      "\n",
      "# form feature types 0 for nominal (attribute X), 2 for continuous (attribute Y)\n",
      "ft = array([0, 2],dtype=int32)\n",
      "\n",
      "# cache training matrix\n",
      "train_feats_cache=RealFeatures(train_feats.get_feature_matrix())\n",
      "\n",
      "# get back trained tree - dependent variable type is nominal (hence 0), number of bins for binning is 10  \n",
      "chaid = train_chaidtree(0,ft,10,train_feats,train_labels)\n",
      "\n",
      "print('updated_matrix')\n",
      "print(train_feats.get_feature_matrix())\n",
      "print('')\n",
      "print('original_matrix')\n",
      "print(train_feats_cache.get_feature_matrix())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "An important point to be noted in the above code snippet is that CHAID training modifies the training data. The actual continuous feature values are replaced by the discrete ordinal values obtained during continuous to ordinal conversion. Notice the difference between the original feature matrix and the updated matrix. The updated matrix contains only 10 distinct values denoting all values of the original matrix for feature dimension at row index 1.\n",
      "\n",
      "With a CHAID-trained decision tree at our disposal, it's time to apply it to colour our test points. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# get output labels\n",
      "output_labels = chaid.apply_multiclass(test_feats)\n",
      "\n",
      "plot_toy_classification_results(train_feats_cache,train_labels,test_feats,output_labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Regression example with toy dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section, we re-work the sinusoid curve fitting example (earlier used in CART toy regression)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train_feats,train_labels = create_toy_regression_dataset(300,0.5)\n",
      "plot_ref_sinusoid()\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As usual, we start by setting up our decision tree and training it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from numpy import dtype, int32, array\n",
      "\n",
      "# feature type - continuous\n",
      "feat_type = array([2],dtype=int32)\n",
      "\n",
      "# get back trained tree\n",
      "chaid = train_chaidtree(2,feat_type, 50, train_feats, train_labels)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, we use the trained decision tree to follow the reference sinusoid."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plot_predicted_sinusoid(chaid)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A distinguishing feature about the predicted curve is the presence of steps. These steps are essentially an artifact of continuous to ordinal conversion. If we decrease the number of bins for the conversion the step widths will increase."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Classification example over real dataset "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section, we will try to estimate the quality of wine based on 13 attributes like alcohol content, malic acid, magnesium content, etc. using the [wine dataset](https://archive.ics.uci.edu/ml/datasets/Wine). Let us first read the dataset using Shogun's [CSV file reader](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CCSVFile.html)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import CSVFile, RealFeatures, MulticlassLabels\n",
      "\n",
      "train_feats=RealFeatures(CSVFile('../../../../data/uci/wine/fm_wine.dat'))\n",
      "train_labels=MulticlassLabels(CSVFile('../../../../data/uci/wine/label_wine.dat'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Like the case of CART, here we are also interested in finding out the approximate accuracy with which our CHAID tree trained on this dataset will perform in real world. Hence, we will apply the cross validation strategy. But first we specify the parameters of the CHAID tree."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import CHAIDTree, MulticlassLabels\n",
      "\n",
      "# set attribute types - all attributes are continuous(2)\n",
      "feature_types = array([2 for i in range(13)],dtype=int32)    \n",
      "\n",
      "# setup CHAID tree - dependent variable is nominal(0), feature types set, number of bins(20)\n",
      "chaid = CHAIDTree(0,feature_types,20)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we set up the cross-validation class and get back the error estimate we want i.e mean classification error."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# set up cross validation class\n",
      "\n",
      "from modshogun import CrossValidation, CrossValidationSplitting, CrossValidationResult, MulticlassAccuracy\n",
      "\n",
      "# set evaluation criteria - multiclass accuracy\n",
      "accuracy = MulticlassAccuracy()\n",
      "    \n",
      "# set splitting criteria - 10 fold cross-validation\n",
      "split = CrossValidationSplitting(train_labels,10)\n",
      "\n",
      "# set cross-validation parameters \n",
      "cross_val = CrossValidation(chaid,train_feats,train_labels,split,accuracy,False)\n",
      "\n",
      "# run cross-validation multiple times\n",
      "cross_val.set_num_runs(10)\n",
      "\n",
      "print('Mean classification accuracy : '+str(CrossValidationResult.obtain_from_generic(cross_val.evaluate()).mean*100)+' %')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Regression example using real dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this section, we try to predict the value of houses in Boston using 13 attributes, like per capita crime rate in neighborhood, number of rooms, nitrous oxide concentration in air, proportion of non-retail business in the area etc. Out of the 13 attributes 12 are continuous and 1 (the Charles river dummy variable) is binary nominal. Let us load the dataset as our first step. For this, we can directly use Shogun's [CSV file reader class](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CCSVFile.html)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import CSVFile, RealFeatures, RegressionLabels\n",
      "from numpy import ptp\n",
      "\n",
      "train_feats=RealFeatures(CSVFile('../../../../data/uci/housing/fm_housing.dat'))\n",
      "train_labels=RegressionLabels(CSVFile('../../../../data/uci/housing/housing_label.dat'))\n",
      "\n",
      "# print range of regression labels - this is useful for calculating relative deviation later \n",
      "print('labels range : '+str(ptp(train_labels.get_labels())))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next, we set up the parameters for the CHAID tree as well as the cross-validation class."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from modshogun import CHAIDTree, MeanSquaredError\n",
      "from modshogun import CrossValidation, CrossValidationSplitting, CrossValidationResult\n",
      "from numpy import array, dtype, int32\n",
      "\n",
      "def get_cv_error(max_depth):\n",
      "    # set feature types - all continuous(2) except 4th column which is nominal(0) \n",
      "    feature_types = array([2]*13,dtype=int32)\n",
      "    feature_types[3]=0\n",
      "    feature_types[8]=1\n",
      "    feature_types[9]=1    \n",
      "\n",
      "    # setup CHAID-tree\n",
      "    chaid = CHAIDTree(2,feature_types,10)\n",
      "    # set max allowed depth\n",
      "    chaid.set_max_tree_depth(max_depth)\n",
      "\n",
      "    # set evaluation criteria - mean squared error\n",
      "    accuracy = MeanSquaredError()\n",
      "    # set splitting criteria - 5 fold cross-validation\n",
      "    split = CrossValidationSplitting(train_labels,5)\n",
      "    # set cross-validation parameters \n",
      "    cross_val = CrossValidation(chaid,train_feats,train_labels,split,accuracy,False)\n",
      "\n",
      "    # run cross-validation multiple times\n",
      "    cross_val.set_num_runs(3)\n",
      "\n",
      "    # return cross validation result\n",
      "    return CrossValidationResult.obtain_from_generic(cross_val.evaluate()).mean"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as plt\n",
      "% matplotlib inline\n",
      "cv_errors = [get_cv_error(i) for i in range(1,10)]\n",
      "plt.plot(range(1,10),cv_errors,'bo',range(1,10),cv_errors,'k')\n",
      "plt.xlabel('max_allowed_depth')\n",
      "plt.ylabel('cross-validated error')\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the above figure, we see that tree depth of 2-4 is most optimal and gives a mean squared error of ~25 which is a deviation of ~$\\pm5$. We already calculated the range of labels to be 45.0, hence the relative deviation comes out to be 11.11%"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "References"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science\n",
      "\n",
      "[2] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1: 1 (Mar. 1986), 81-106"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}